Welcome back!

Updates

  • Exam for exchange students: 21.12.2023 at 16:15 in room 01-013.
  • Materials on text analysis and image analysis (lecture 6) is for self study
  • The mock exam is online 💣
    • Central exam from last year
    • Solutions won’t be discussed in the lecture
    • Use the forum on Canvas to discuss solutions
    • No forum supervision/email during the learning phase guaranteed (from me or the TAs)

Part II: Data gathering and preparation

Date Topic
16.11.2023 Data preparation and manipulation
23.11.2023 Basic statistics and data analysis with R
23.11.2023 Exercises/Workshop 4: Data gathering, data import
30.11.2023 Guest Lecture: Matteo Courthoud (Senior Economist and Data Scientist @Zalando)

Part III: Analysis, visualisation, output

Date Topic
07.12.2023 Visualisation, dynamic documents
07.12.2023 Exercises/Workshop 5: Data preparation and applied data analysis with R
14.12.2023 Guest Lecture: Florian Chatagny (Head of Data Science @Federal Finance Administration in Bern)
21.12.2023 Exercises/Workshop 6: Visualization, dynamic documents
21.12.2023 Summary, Wrap-Up, Q&A, Feedback
21.12.2023 Exam for Exchange Students

Summary and warm up

Summary

Data

Rectangular data

  • Import data from text files, csv, tsv, etc.
  • Tibbles, data frames in \(R\)


Non-rectangular data

  • Hierarchical data (xml, html, json)
  • Unstructured text data
  • Images/Pictures data

A Template/Blueprint

Tell your future self what this script is all about

#######################################################################
# Project XY: Data Gathering and Import
#
# This script is the first part of the data pipeline of project XY.
# It imports data from ...
# Input: links to data sources (data comes in ... format)
# Output: cleaned data as CSV
# 
# A. Sallin, St. Gallen, 2023
#######################################################################


# SET UP --------------
# load packages
library(tidyverse)

# set fix variables
INPUT_PATH <- "/rawdata"
OUTPUT_FILE <- "/final_data/datafile.csv"


# IMPORT RAW DATA FROM CSVs -------------

Warm up

JSON files: open-ended question

Be the JSON file

{
  "students": [
    {
      "id": 19091,
      "firstName": "Peter",
      "lastName": "Mueller",
      "grades": {
          "micro": 5,
          "macro": 4.5,
          "data handling": 5.5
          }
    },
    {
      "id": 19092,
      "firstName": "Anna",
      "lastName": "Schmid",
      "grades": {
          "micro": 5.25,
          "macro": 4,
          "data handling": 5.75
          }
    },
    {
      "id": 19093,
      "firstName": "Noah",
      "lastName": "Trevor",
      "grades": {
          "micro": 4,
          "macro": 4.5,
          "data handling": 5
          }
    }
  ]
}

Write an R code to extract a table with, as a first column, a vector of first names, and as a second column, the average grade per student. The table can be a data frame or a tibble.

XML

<students>
  <student>
    <id>19091</id>
    <firstName>Peter</firstName>
    <lastName>Mueller</lastName>
    <grades>
      <micro>5</micro>
      <macro>4.5</macro>
      <dataHandling>5.5</dataHandling>
    </grades>
  </student>
  <student>
    <id>19092</id>
    <firstName>Anna</firstName>
    <lastName>Schmid</lastName>
    <grades>
      <micro>5.25</micro>
      <macro>4</macro>
      <dataHandling>5.75</dataHandling>
    </grades>
  </student>
  <student>
    <id>19093</id>
    <firstName>Noah</firstName>
    <lastName>Trevor</lastName>
    <grades>
      <micro>4</micro>
      <macro>4.5</macro>
      <dataHandling>5</dataHandling>
    </grades>
  </student>
</students>
  • ‘students’ is the root-node, ‘grades’ are its children
  • the siblings of Trevor Noah are Anna Schmid and Peter Mueller
  • The code below would be an alternative, equivalent notation for the third student in the xml file above.
<student id="19093" firstName="Noah" lastName="Trevor">
      <grades micro="4" macro="4.5" dataHandling="5" />
</student>

Part II: Data gathering and preparation

Part II: Data gathering and preparation

Part II: Data gathering and preparation

Goals for today

Goals for today: cognitive goals

  • Recognize where the problems are in a given dataset, and what is in the way of a proper analysis of the data.
  • Organize your work: what needs to be addressed first?

Goals for today: skills

  • Use simple string-operations to clean text variables.
  • Reshape datasets from wide to long (and vice versa).
  • Apply row-binding/stacking of datasets

Data Preparation

The dataset is imported, now what?

  • In practice: still a long way to go.
  • Parsable, but messy data: Inconsistencies, data types, missing observations, wide format.

The dataset is imported, now what?

  • In practice: still a long way to go.
  • Parsable, but messy data: Inconsistencies, data types, missing observations, wide format.


  • Goal of data preparation: dataset is ready for analysis.
  • Key conditions:
    1. Data values are consistent/clean within each variable.
    2. Variables are of proper data types.
    3. Dataset is in ‘tidy’ (in long format, more on this after the break)!

“Garbage in garbage out” principle

Move to Nuvolos

Data preparation: three concepts

  • Tidy data
  • Reshaping
  • Stacking

Tidy data: some vocabulary

Following Wickham (2014):

  • Dataset: Collection of values (numbers and strings).
  • Every value belongs to a variable and an observation.
  • Variable: Contains all values that measure the same underlying attribute across units.
  • Observation: Contains all values measured on the same unit (e.g., a person).

Tidy data

Three examples of non-tidy data (1)

Not tidy:

##       measure Jan.1 Jan.2 Jan.3
## 1 Temperature    20    22    21
## 2    Humidity    80    78    82


Tidy:

Three examples of non-tidy data (1)

Not tidy 💩

##       measure Jan.1 Jan.2 Jan.3
## 1 Temperature    20    22    21
## 2    Humidity    80    78    82


Tidy 😎

## # A tibble: 3 × 3
##   Date  Temperature Humidity
##   <chr>       <dbl>    <dbl>
## 1 Jan.1          20       80
## 2 Jan.2          22       78
## 3 Jan.3          21       82

Three examples of non-tidy data (2)

Not tidy:

##   temperature_location
## 1           22C_London
## 2            18C_Paris
## 3             25C_Rome


Tidy:

homework..

Three examples of non-tidy data (3)

Not tidy:

##    Student Econ DataHandling Management
## 1 Johannes 5.00          4.0        5.5
## 2   Hannah 5.25          4.5        6.0
## 3     Igor 4.00          5.0        6.0


Tidy:

homework..

Reshaping: the concept

Stack/row-bind: the concept

Stack/row-bind: implementation in R

  • Use rbind() in base R
    • Requires that the data frames have the same column names and same column classes.


  • Use bind_rows() from dplyr()
    • More flexible.
    • Binds data frames with different column names and classes
    • Automatically fills missing columns with NA


For these reasons (+ performance, handling or row names, and handling of factors), dplyr::bind_rows() is preferred.

Move to Nuvolos

Summary

Reshaping: summary

Reshaping: summary

Q&A

References